l1: add replicated database #29161

andrwng · 2026-01-06T20:58:30Z

Builds on top of the new LSM STM and introduces a new replicated_database abstraction that is intended to be opened only on leaders. It is an lsm::database whose data and metadata storage are backed by object storage for recoverability, and whose manifest is replicated through Raft.

After opening the database from the serialized manifest in the STM, leaders are expected to apply the remaining write batches from the volatile buffer before serving subsequence requests. This expectation is encoded in the replicated_database::open() call.

Backports Required

Release Notes

None

Copilot

Pull request overview

This PR introduces a new replicated_database abstraction that wraps an LSM database with Raft-based replication. The database is leader-only and uses object storage for data persistence while replicating its manifest through Raft for fault tolerance.

Key changes:

Added timeout support to lsm::database::flush() to prevent indefinite blocking
Introduced memory_persistence_controller for testing failure scenarios
Implemented replicated_database class that coordinates LSM operations with Raft replication

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/v/lsm/lsm.h	Added optional timeout parameter to flush method signature
src/v/lsm/lsm.cc	Implemented timeout parameter forwarding in flush wrapper
src/v/lsm/io/memory_persistence.h	Added controller struct for injecting failures in tests
src/v/lsm/io/memory_persistence.cc	Implemented failure injection logic in memory persistence
src/v/lsm/db/tests/impl_test.cc	Added test for flush timeout behavior
src/v/lsm/db/impl.h	Updated flush signature with timeout parameter
src/v/lsm/db/impl.cc	Implemented timeout enforcement in flush operation
src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc	Comprehensive test suite for replicated database functionality
src/v/cloud_topics/level_one/metastore/lsm/tests/BUILD	Build configuration for new test
src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.h	Interface for Raft-replicated metadata persistence
src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc	Implementation of replicated metadata persistence
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.h	Header for replicated database abstraction
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc	Core implementation of replicated database operations
src/v/cloud_topics/level_one/metastore/lsm/BUILD	Build configuration for new libraries

src/v/lsm/lsm.h

src/v/lsm/db/impl.h

src/v/lsm/db/impl.cc

src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc

rockwotj

nice, a quick glance mostly lgtm. Will look more next week back at a computer

src/v/lsm/lsm.h

src/v/lsm/db/impl.cc

rockwotj · 2026-01-09T02:06:51Z

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc

+    read_manifest(lsm::internal::database_epoch max_epoch) override {
+        _as.check();
+        auto _ = _gate.hold();
+        auto term_result = co_await _stm->sync(std::chrono::seconds(30));


do we need to make all this abortable too? Maybe the io layer needs an abort source in the apis... Anyways not for this PR

Is it supposed to be invoked right after the leadership transfer?

do we need to make all this abortable too?

Done

Is it supposed to be invoked right after the leadership transfer?

Yea, the expectation is that this is called via opening the database before performing any updates on the database in a given term

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc

Lazin · 2026-01-09T11:32:39Z

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc

+    read_manifest(lsm::internal::database_epoch max_epoch) override {
+        _as.check();
+        auto _ = _gate.hold();
+        auto term_result = co_await _stm->sync(std::chrono::seconds(30));


Is it supposed to be invoked right after the leadership transfer?

Lazin · 2026-01-09T11:35:38Z

src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc

+  cloud_io::remote* remote,
+  const cloud_storage_clients::bucket_name& bucket,
+  ss::abort_source& as) {
+    auto term_result = co_await s->sync(std::chrono::seconds(30));


same question, is it expected to be invoked right after the leadership transfer or the start?

This is expected to be called upon becoming leader before replicating any LSM updates in the given term (hence all LSM updates go through an already opened replicated_database instance)

Lazin · 2026-01-09T11:37:35Z

src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc

+    // Replay the writes in the volatile_buffer as writes to the database.
+    // These are writes that were replicated but not yet persisted to the
+    // manifest.
+    auto max_persisted_seqno = db.max_persisted_seqno();


Do I understand correctly that this replay is not the same as the STM log replay? Here we're applying batches which are already stored by the STM (in other words they're applied to the STM but not to the LSM).

That's correct, STM log replay gets us the Raft replicated entries of the volatile buffer that have not yet been persisted in the LSM manifest. This replay here applies those write batches on top so the opened database is caught up to the tip of the committed log.

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc

vbotbuildovich · 2026-01-12T20:54:25Z

CI test results

test results on build#78915

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
EndToEndCloudTopicsTest	test_write	null	integration	https://buildkite.com/redpanda/redpanda/builds/78915#019bb3b7-0f88-4c50-b4a4-a8359b19aa0a	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndCloudTopicsTest&test_method=test_write

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Copilot · 2026-01-12T22:04:39Z

src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc

+        .row
+        = write_batch_row{.key = "key_before_reset", .value = iobuf::from("value_before_reset"),},


The line formatting breaks the designated initializer on a single line by placing the closing brace and comma separately. This should be reformatted to either fit on one line or break consistently across multiple lines for better readability.

Suggested change

.row

= write_batch_row{.key = "key_before_reset", .value = iobuf::from("value_before_reset"),},

.row = write_batch_row{

.key = "key_before_reset",

.value = iobuf::from("value_before_reset"),

},

Copilot · 2026-01-12T22:04:39Z

src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc

+      volatile_row{
+        .seqno = lsm::sequence_number{100},
+        .row
+        = write_batch_row{.key = "reset_key", .value = iobuf::from("reset_value"),},


The line formatting breaks the designated initializer on a single line by placing the closing brace and comma separately. This should be reformatted to either fit on one line or break consistently across multiple lines for better readability.

Copilot · 2026-01-12T22:04:39Z

src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc

+      volatile_row{
+        .seqno = lsm::sequence_number{100},
+        .row
+        = write_batch_row{.key = "reset_key", .value = iobuf::from("reset_value"),},


The line formatting breaks the designated initializer on a single line by placing the closing brace and comma separately. This should be reformatted to either fit on one line or break consistently across multiple lines for better readability.

Plumbs a new struct into memory persistence to allow tests to fail operations. In the future this can be used to inject delays, randomized failures, etc.

andrwng · 2026-01-13T21:30:47Z

Force push to rebase on dev

src/v/lsm/db/impl.cc

In case of errors in the metadata persistence layer, flush would previously hang until success. This adds an optional timeout for this case, which will be useful for an upcoming metadata persistence layer that uses Raft.

Introduces a wrapper around cloud_persistence that replicates and serves the database manifest from Raft (while maintaining it in object storage as well). A subsequent commit will introduce usage of this to maintain a database across replicas of a Raft group.

Introduces a class that wraps lsm::database with the appropriate object storage classes to be consistent across replica leaders (i.e. different instances see a consistent view of the database upon leadership changes).

rockwotj · 2026-01-13T21:50:06Z

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc

+
+    ss::future<std::optional<iobuf>>
+    read_manifest(lsm::internal::database_epoch max_epoch) override {
+        _as.check();


btw I think this is kind of useless because we never call close until after all callers of this method have returned. Doesn't need to block this PR, we can fix in a followup (I want to integrate @nvartolomei's context thing into the LSM)

rockwotj · 2026-01-13T22:18:36Z

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc

+            // There is no persisted manifest.
+            co_return std::nullopt;
+        }
+        co_return _stm->state().persisted_manifest->buf.copy();


probably could share this out too

Will put out a follow-up (and some shares missed in the other PR)

github-actions bot added area/build area/redpanda labels Jan 6, 2026

andrwng mentioned this pull request Jan 6, 2026

l1: state machine for replicated LSM #29160

Merged

7 tasks

andrwng force-pushed the l1-replicated-db branch 3 times, most recently from 2a46d7c to 37e3199 Compare January 8, 2026 18:37

andrwng marked this pull request as ready for review January 8, 2026 19:00

Copilot AI review requested due to automatic review settings January 8, 2026 19:00

Copilot AI reviewed Jan 8, 2026

View reviewed changes

andrwng force-pushed the l1-replicated-db branch 2 times, most recently from c6676df to a69fd10 Compare January 8, 2026 22:25

andrwng requested review from Lazin, dotnwat and rockwotj January 9, 2026 00:08

rockwotj reviewed Jan 9, 2026

View reviewed changes

Lazin reviewed Jan 9, 2026

View reviewed changes

andrwng commented Jan 9, 2026

View reviewed changes

src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc Outdated Show resolved Hide resolved

andrwng force-pushed the l1-replicated-db branch 2 times, most recently from f14503f to cfd5991 Compare January 12, 2026 19:27

andrwng requested review from Lazin, Copilot and rockwotj January 12, 2026 22:03

Copilot AI reviewed Jan 12, 2026

View reviewed changes

lsm: add simple control for memory persistence

ebfa6bd

Plumbs a new struct into memory persistence to allow tests to fail operations. In the future this can be used to inject delays, randomized failures, etc.

andrwng force-pushed the l1-replicated-db branch from cfd5991 to ab8c3cd Compare January 13, 2026 21:30

rockwotj previously approved these changes Jan 13, 2026

View reviewed changes

src/v/lsm/db/impl.cc Outdated Show resolved Hide resolved

andrwng added 2 commits January 13, 2026 13:53

lsm: add a flush timeout

2d5e5f1

In case of errors in the metadata persistence layer, flush would previously hang until success. This adds an optional timeout for this case, which will be useful for an upcoming metadata persistence layer that uses Raft.

ct/l1/lsm: add conversions between RP and LSM space

bcb131e

andrwng added 3 commits January 13, 2026 13:53

ct/l1/lsm: add abort source to stm::sync()

3ea881f

ct/l1/lsm: add replicated_db

1398d70

Introduces a class that wraps lsm::database with the appropriate object storage classes to be consistent across replica leaders (i.e. different instances see a consistent view of the database upon leadership changes).

andrwng dismissed rockwotj’s stale review via 1398d70 January 13, 2026 21:53

andrwng force-pushed the l1-replicated-db branch from ab8c3cd to 1398d70 Compare January 13, 2026 21:53

andrwng requested a review from rockwotj January 13, 2026 21:53

rockwotj approved these changes Jan 13, 2026

View reviewed changes

andrwng merged commit 7afd1bb into redpanda-data:dev Jan 14, 2026
19 checks passed

andrwng mentioned this pull request Jan 14, 2026

ct/l1/lsm: some iobuf shares #29248

Merged

7 tasks

		.row
		= write_batch_row{.key = "key_before_reset", .value = iobuf::from("value_before_reset"),},

l1: add replicated database #29161

l1: add replicated database #29161

Uh oh!

Conversation

andrwng commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rockwotj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vbotbuildovich commented Jan 12, 2026

CI test results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

andrwng commented Jan 13, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrwng commented Jan 6, 2026 •

edited

Loading